Group M (Mohamed Boutaleb, Francesco Moro)

Machine Learning Project - Diamond Analysis

Introduction

In this notebook will be performed the analysis of the diamond dataset, a collection of almost 54,000 diamonds.
To realize interesting analysis and predictions about the considered dataset, we decided to maintain as target variable the column named "price". In this way it is possible to predict it through different methods, allowing us to compare the results obtained.

In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[1]:
In [2]:
### Keeping safe from any disaster :) ###
%autosave 30
Autosaving every 30 seconds

4C's Diamond

drawing

Image source: Diamond Education Learning

In a nutshell, to understand how much a diamond is worth, we should start with the basics, i.e. the 4 C's (CARAT, CLARITY, COLOR, CUT). These explain very well the quality of a diamond:

  • Carat: Carat is the weight of the diamond. Bigger the diamond, rare to find them.
  • Clarity: Clarity of the diamond is intrusions in the diamond, which could be internal flaws, external defects.
  • Color: Color of the diamond is self explinary, and on scale of AtoZ, A is best and colorless, where as Z is worst and yellowish color. The colorless diamonds are rare and costly.
  • Cut: Cut of the diamond is about physical dimensions of the diamond. It determine the light amount enter into the diamond. Anotomy of the diamond is an important parameter.

Cut and Components

drawing drawing

There are several things to consider to see in a diamond. Few such is table%, depth% and cuts. Diamond anotomy is given it the cut of the diamond infographic in the above image.

All the basic information about the diamond cut are given above. Side view of the diamond show the terminology and meaning of his features. Thus table length, table width, diamond depth, depth%, and table% are very important features to determine the price of a diamond.

Libraries used

For these analyses we used some libraries covered during the course and partly others based on their documentation. Below is the list:

  • numpy
  • pandas
  • plotly
  • scikit learn
  • ml_utilities
  • ml_visualization

  • seaborn

  • matplotlib
In [3]:
### Imports ###
import numpy as np
import pandas as pd

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly
import plotly.offline as py
import matplotlib.pyplot as plt

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LinearRegression 
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

from IPython.display import Markdown, display


from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler

# importing the modules
from IPython.display import display

from sklearn.model_selection import GridSearchCV, KFold, cross_val_score

import seaborn as sns
import matplotlib.pyplot as plt

import scipy.stats as st

import ml_utilities
import ml_visualization

Reading the dataset

In order to properly import the dataset you need to:

  • Download the zipper from here
  • Extract it
  • Move it into the current directory ./.
In [4]:
### Reading .csv dataset ###
dfName = "diamonds.csv"
df = pd.read_csv(dfName)

Preliminary analysis

What are the columns?

In [5]:
### Describing the df characteristics ###
df.describe()
Out[5]:
Unnamed: 0 carat depth table price x y z
count 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000
mean 26970.500000 0.797940 61.749405 57.457184 3932.799722 5.731157 5.734526 3.538734
std 15571.281097 0.474011 1.432621 2.234491 3989.439738 1.121761 1.142135 0.705699
min 1.000000 0.200000 43.000000 43.000000 326.000000 0.000000 0.000000 0.000000
25% 13485.750000 0.400000 61.000000 56.000000 950.000000 4.710000 4.720000 2.910000
50% 26970.500000 0.700000 61.800000 57.000000 2401.000000 5.700000 5.710000 3.530000
75% 40455.250000 1.040000 62.500000 59.000000 5324.250000 6.540000 6.540000 4.040000
max 53940.000000 5.010000 79.000000 95.000000 18823.000000 10.740000 58.900000 31.800000
In [6]:
### How many non-null fields are there? ###
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  53940 non-null  int64  
 1   carat       53940 non-null  float64
 2   cut         53940 non-null  object 
 3   color       53940 non-null  object 
 4   clarity     53940 non-null  object 
 5   depth       53940 non-null  float64
 6   table       53940 non-null  float64
 7   price       53940 non-null  int64  
 8   x           53940 non-null  float64
 9   y           53940 non-null  float64
 10  z           53940 non-null  float64
dtypes: float64(6), int64(2), object(3)
memory usage: 4.5+ MB
In [7]:
df.head()
Out[7]:
Unnamed: 0 carat cut color clarity depth table price x y z
0 1 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 2 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 3 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 4 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 5 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75

Dropping and remapping columns

In [8]:
### Dropping unnecessary column. Additional index ###
df.drop('Unnamed: 0', axis=1, inplace=True)

df_mapped=df.copy()

### Remapping all necessary columns
df_mapped['color'] = df['color'].replace(['J','I','H','G','F','E','D'],[1,2,3,4,5,6,7])
df_mapped['cut'] = df['cut'].replace(['Fair','Good','Very Good','Premium','Ideal'],[1,2,3,4,5])
df_mapped['clarity'] = df['clarity'].replace(['I3','I2','I1','SI2','SI1','VS2','VS1','VVS2','VVS1','IF'],[1,2,3,4,5,6,7,8,9,10])
In [9]:
#Functino to print in Markdown
def printmd(string):
    display(Markdown(string))

First question we asked ourselves

Is there a correlation between price and carats?

Clearly we notice immediately that the pair of carat-price features are directly proportional: as the carats increase we have confirmation that the price increases easily.

In [10]:
# # Scatter plot of the carat and price
# fig=make_subplots(rows=1, cols=1)
# fig.add_trace(go.Scatter(x=df_mapped["carat"], y=df_mapped["price"], mode="markers", name="Diamonds price basing on carats"))

# fig.update_layout(
#     title="Diamonds price basing on carats",
#     xaxis_title="Carat",
#     yaxis_title="Price",
#     legend_title="Legend Title",
# )

plt.plot(df_mapped["carat"], df_mapped["price"], 'o', color='blue');

How do the other attributes of the diamond infer price?

In this plot we can see an alternative and colorful representation of how each feature infer the price.
It is useful observe the legend on the right side that will explain how interpret each element in the plot.

In [11]:
# showing how the feature could infer price
plt.figure(figsize=(16,8))
sns.scatterplot(data=df, x="carat", y="price", hue="color",size="cut", style="clarity", sizes=(10,250))
plt.title("Influence of attributes over price")
plt.show()

How do price quantiles change according to categorical attributes?

In the following set of charts, there will be considered how the categorical attributes infer price variable. It is possible to get an alternative view of how the price varies for each categorical variable. What we expected was that the quantiles of price would increase as the quality of the categorical variables increased. And indeed despite the numerous outliers this pattern is easily identifiable from the boxplots.

In [12]:
categoricalAttributes=["clarity", "cut", "color"]
targetVariable="price"

for attribute in categoricalAttributes:
    fig = px.box(data_frame=df_mapped,orientation="h",y=attribute,x="price",title=f"{targetVariable} related to {attribute}")
    fig.show()

What is the distribution of the diamonds in this Dataset according to attributes?

This set of plot represents the relation between each feature with the considered target. Clearly, the pairs clarity-price, color-price and cut-price will consider a discrete dimensional space due to its categorical nature. The other ones will represent the distribution of price in a continuous context.

In [13]:
df_mapped.hist(bins=50, figsize=(20,15))
plt.suptitle("Features distributions")
Out[13]:
Text(0.5, 0.98, 'Features distributions')

Linear Regression

As said before we will make analyses where the target price variable will be the focus point. Since this says a lot about the diamond. Let's do a pre-analysis to see from the correlation value which features are most closely correlated with price.

Computing correlation matrix to detect some linear patterns

In [14]:
### calculation of the correlation coefficient between price and all other fields ###
corr_matrix = df_mapped.corr()

print(corr_matrix["price"].sort_values(ascending = False))
price      1.000000
carat      0.921591
x          0.884435
y          0.865421
z          0.861249
table      0.127134
depth     -0.010647
cut       -0.053491
clarity   -0.146800
color     -0.172511
Name: price, dtype: float64

As can be seen from the graphs and the data (R2), we can notice that the correlation between carats and price is linear, the same with the x, y and z features. What tend to have a normal distribution are the features depth and table.

To make the computation time of the code more optimal we made a sample of the dataset to make the plots of each linear regression

In [15]:
rows=1
cols=0
#split dataset in features and target variable
feature_cols = ['carat', 'table', 'x', 'y', 'z', 'depth']
# feature_cols = ['depth']
valid_portion=0.30
fig = make_subplots(rows=2, cols=3, subplot_titles=("Linear Regression for Carat feature", "Linear Regression for Table feature","Linear Regression for X feature", 
                                                   "Linear Regression for Y feature", "Linear Regression for Z feature", "Linear Regression for Depth feature"))
fig['layout']['xaxis1']['title']=feature_cols[0]
fig['layout']['xaxis2']['title']=feature_cols[1]
fig['layout']['xaxis3']['title']=feature_cols[2]
fig['layout']['xaxis4']['title']=feature_cols[3]
fig['layout']['xaxis5']['title']=feature_cols[4]
fig['layout']['xaxis6']['title']=feature_cols[5]

for i in range(1, 7):
    fig['layout']['yaxis{}'.format(i)]['title']="price"


for i in feature_cols:
    X = df_mapped[i] # Features
    y = df_mapped.price.astype('float64') # Target variable

    train_x, validation_x, train_y, validation_y = train_test_split(X, y, test_size=valid_portion, random_state=1)
    
    train_x=train_x.values.reshape(-1, 1)
    validation_x=validation_x.values.reshape(-1, 1)
    
    train_y=train_y.values.reshape(-1, 1)
    validation_y=validation_y.values.reshape(-1, 1)
    
    # Addestramento di un LinearRegression
    lin_reg = LinearRegression()
    lin_reg.fit(train_x, train_y)

    # Ottenimento delle predizioni
    train_y_predicted = lin_reg.predict(train_x)

    # Calcolo del RMSE
    rmse = np.sqrt(mean_squared_error(train_y, train_y_predicted))
    printmd(f"**Data Information on {i}**")
    print('Train RMSE: ', rmse) 

    # Ottenimento delle predizioni (validation) e calcolo RMSE
    validation_y_predicted = lin_reg.predict(validation_x)
    rmse = np.sqrt(mean_squared_error(validation_y, validation_y_predicted))
    print('Validation RMSE: ', rmse)
    print('R2 score:', lin_reg.score(validation_x, validation_y))
    
    w=lin_reg.coef_[0][0]
    b=lin_reg.intercept_
    
    y_hat = b +w*X
    
    x_line = np.array([np.min(X), np.max(X)]) # Prova a modificare gli estremi, per estrapolare la linea
    y_line = b + w*x_line
    
    if cols==3:
        cols=0
        rows=2
    
    cols=cols+1

    fig.add_trace(go.Scatter(x=pd.Series(X).sample(n=500, random_state=132), y=pd.Series(y).sample(n=500, random_state=132), mode="markers", name="data"), row=rows, col=cols)
    fig.add_trace(go.Scatter(x=pd.Series(X).sample(n=500, random_state=132), y=pd.Series(y_hat).sample(n=500, random_state=132), mode="markers", name="estimate"),row=rows, col=cols)
    fig.add_trace(go.Scatter(x=pd.Series(X).sample(n=500, random_state=132), y=pd.Series(y_hat).sample(n=500, random_state=132), mode="lines", name="regression line"),row=rows, col=cols)

#     # Disegnamo le linee verticali corrispondenti agli errori (residui)
#     for i in range(len(X)):
#         data.append(go.Scatter(x=[X[i], X[i]], y=[y[i], y_hat[i]], mode="lines",
#             showlegend=False, line=dict(color="gray", width=0.5)),)

fig.show()

Data Information on carat

Train RMSE:  1560.5276938406769
Validation RMSE:  1520.3317090552023
R2 score: 0.8493196667739158

Data Information on table

Train RMSE:  3987.536167459719
Validation RMSE:  3885.017951050689
R2 score: 0.016066098278325724

Data Information on x

Train RMSE:  1877.118093178531
Validation RMSE:  1825.6024177121258
R2 score: 0.7827337495915477

Data Information on y

Train RMSE:  2070.0363229147342
Validation RMSE:  1825.6734589316966
R2 score: 0.7827168399297663

Data Information on z

Train RMSE:  2073.91823176725
Validation RMSE:  1915.4660086333242
R2 score: 0.7608178508319458

Data Information on depth

Train RMSE:  4019.899198428712
Validation RMSE:  3916.634223668388
R2 score: -1.3571168991033389e-05

To make the idea even clearer, it may be useful to look at this heatmap of all the correlations between the various columns. We can see how the features carat, x, y and z are closely correlated with price. Although when we later go to do the Lasso we will see that in reality x is never sent to 0.

In [16]:
#Correlation with Price column
fig=plt.figure(figsize=(15,8))
sns.heatmap(df_mapped.corr(), linewidths=3, annot=True)
plt.title("Correlation matrics", size=20)
Out[16]:
Text(0.5, 1.0, 'Correlation matrics')
In [17]:
#split dataset in features and target variable
feature_cols = ['clarity', 'carat', 'cut', 'color','depth','table','x', 'y', 'z']
X = df_mapped[feature_cols] # Features
y = df_mapped.price.astype('float64') # Target variable
In [18]:
# split X and y into training and testing sets
valid_portion = 0.5

train_x, validation_x, train_y, validation_y = train_test_split(X, y, test_size=valid_portion, random_state=1)

Start with RandomForest to check efficiency

We can see that with RandomForest the RMSE would look very good as well as the R2. But as we will see later on, to check its veracity we need to do a K-Fold.

In [19]:
# Addestramento di un RandomForestRegressor
rndfor_reg = RandomForestRegressor()
rndfor_reg.fit(train_x, train_y)

# Ottenimento delle predizioni
train_y_predicted = rndfor_reg.predict(train_x)

# Calcolo del RMSE
rmse = np.sqrt(mean_squared_error(train_y, train_y_predicted))
printmd("**Regressor Information**")
print('Train RMSE: ', rmse) 

# Ottenimento delle predizioni (validation) e calcolo RMSE
validation_y_predicted = rndfor_reg.predict(validation_x)
rmse = np.sqrt(mean_squared_error(validation_y, validation_y_predicted))
print('Validation RMSE: ', rmse)
print('R2 score:', rndfor_reg.score(validation_x, validation_y))
printmd("**Information about feature importance according to the Regressor**")
for i, j in zip(feature_cols, rndfor_reg.feature_importances_):
    print(i, j)

Regressor Information

Train RMSE:  210.38427024918457
Validation RMSE:  557.3226182812326
R2 score: 0.9800762350891697

Information about feature importance according to the Regressor

clarity 0.06273029919808185
carat 0.48492800727758106
cut 0.0017636618807771317
color 0.03135778717193375
depth 0.002974174868615615
table 0.002285515641135326
x 0.006083521206886534
y 0.4022566815076644
z 0.005620351247324381

Check if the LinearRegressor can do better

Compared to the RandomForest Regressor we can see that the linear one only makes things worse. Both in terms of score and RMSE. This is to be expected.

In [20]:
# Addestramento di un LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(train_x, train_y)

# Ottenimento delle predizioni
train_y_predicted = lin_reg.predict(train_x)

# Calcolo del RMSE
rmse = np.sqrt(mean_squared_error(train_y, train_y_predicted))
print('Train RMSE: ', rmse) 

# Ottenimento delle predizioni (validation) e calcolo RMSE
validation_y_predicted = lin_reg.predict(validation_x)
rmse = np.sqrt(mean_squared_error(validation_y, validation_y_predicted))
print('Validation RMSE: ', rmse)
print('R2 score:', lin_reg.score(validation_x, validation_y))
Train RMSE:  1229.2485789721898
Validation RMSE:  1204.5442615091756
R2 score: 0.9069313108306197

Doing a K-Fold Cross-Validation to check if it affects RMSE optimisation

By doing K-Fold Cross-Validation we can see that in fact the values we previously obtained with the RandomForestRegressor were due to chance. The true RMSE is 1770.62.

In [21]:
# Ottimizzazione con Cross-Validation
reg = RandomForestRegressor() 

param_grid = [{'n_estimators': [10, 50, 100], 'max_depth': [None, 2, 3], 'random_state': [1234]}]

# Con shuffle=False si otterranno fold temporalmente ordinate
# (se i pattern non sono stati precedentemente mescolati)
experiment_gscv = GridSearchCV(reg, param_grid, \
                               cv=KFold(n_splits=4), \
                               scoring='neg_mean_squared_error')

experiment_gscv.fit(X, y)

# Print results
printmd('**Combinazioni di parametri:**')
print(experiment_gscv.cv_results_['params'])
printmd('**RMSE medio per combinazione:**')
print( np.sqrt(-experiment_gscv.cv_results_['mean_test_score']))
printmd('**Combinazione migliore:**')
print(experiment_gscv.best_params_)
printmd("**RMSE medio della combinazione migliore:**")
print('RMSE medio della combinazione migliore: %.3f' % np.sqrt(-experiment_gscv.best_score_))

Combinazioni di parametri:

[{'max_depth': None, 'n_estimators': 10, 'random_state': 1234}, {'max_depth': None, 'n_estimators': 50, 'random_state': 1234}, {'max_depth': None, 'n_estimators': 100, 'random_state': 1234}, {'max_depth': 2, 'n_estimators': 10, 'random_state': 1234}, {'max_depth': 2, 'n_estimators': 50, 'random_state': 1234}, {'max_depth': 2, 'n_estimators': 100, 'random_state': 1234}, {'max_depth': 3, 'n_estimators': 10, 'random_state': 1234}, {'max_depth': 3, 'n_estimators': 50, 'random_state': 1234}, {'max_depth': 3, 'n_estimators': 100, 'random_state': 1234}]

RMSE medio per combinazione:

[1858.5211656  1806.64242055 1770.62353266 2892.45112031 2868.41034669
 2875.19755245 2506.29438524 2488.8674007  2496.64787653]

Combinazione migliore:

{'max_depth': None, 'n_estimators': 100, 'random_state': 1234}

RMSE medio della combinazione migliore:

RMSE medio della combinazione migliore: 1770.624

What happens if we remove the outliers (z-score >3 or z-score <-3)?

The basic idea would be to be able to remove the outliers from the validation set and see if they actually worsen the R2 (although the R2 obtained so far is very good). But first...

It would be useful to count how many outliers are there per-columns

In [22]:
Q1 = df_mapped.quantile(0.25)
Q3 = df_mapped.quantile(0.75)
IQR=Q3-Q1
df_clean=df_mapped[~((df_mapped<(Q1-1.5*IQR))|(df_mapped>(Q3+1.5*IQR))).any(axis=1)]

cols = df_mapped.columns
all_z_scores=pd.DataFrame()

for col in cols:
    col_zscore = col + '_zscore'
    all_z_scores[col_zscore] = (df_mapped[col] - df_mapped[col].mean())/df_mapped[col].std(ddof=0)

all_outliers={}
arr=[]

for col in all_z_scores.columns:
    all_outliers[col+"_outlier"]=np.sum(all_z_scores[col]>3)+np.sum(all_z_scores[col]<-3)
    
outliers=pd.DataFrame(all_outliers, index=[0])
outliers=outliers.sort_values(by=0, axis=1, ascending=True)
px.histogram(y=outliers.columns, x=outliers.loc[0], title="Number of outliers by every column in the dataframe",labels={ # replaces default labels by column name
                "sum of x": "Number of outliers",  "y": "Column"
            })

Could the R2 score improve if we do testing on the dataset without the outliers?

We expected it to be a lot better but...

In [23]:
feature_cols = ['clarity', 'carat', 'cut', 'color','depth','table','x', 'y', 'z']
X_outliers = df_clean[feature_cols] # Features
y_outliers = df_clean.price.astype('float64') # Target variable
# split X and y into training and testing sets
valid_portion = 0.5
# Split random to train, test set
train_x_outliers, validation_x_outliers, train_y_outliers, validation_y_outliers = train_test_split(X_outliers, y_outliers, test_size=valid_portion, random_state=1)
    
print('R2 score:', rndfor_reg.score(validation_x_outliers, validation_y_outliers)) # R2 score migliora minimamente
R2 score: 0.9873933599457606

Non-Linear Regression

We realised that there were variables that had a correlation with the price that was not strictly linear.

Is there a non-linear correlation between price and other attributes?

In fact, we can see from this scatter that the variables x, y, z and depth do not have a linear trend.

In [24]:
allGraphs=[]

fig, ax = plt.subplots(1, 4, figsize=(70, 20))
ax[0].scatter(x = df_mapped["depth"], y = df_mapped["price"])
ax[0].set_xlabel("Depth")
ax[0].set_ylabel("Price")
ax[0].set_title('Correlation between the depth and the price')

ax[1].scatter(x =df_mapped["x"], y=df_mapped["price"])
ax[1].set_xlabel("X")
ax[1].set_ylabel("Price")
ax[1].set_title('Correlation between the x and the price')
                    
ax[2].scatter(x = df_mapped["y"], y = df_mapped["price"])
ax[2].set_xlabel("Y")
ax[2].set_ylabel("Price")
ax[2].set_title('Correlation between the y and the price')

ax[3].scatter(y =df_mapped["price"], x=df_mapped["z"])
ax[3].set_xlabel("Z")
ax[3].set_ylabel("Price")
ax[3].set_title('Correlation between the z and the price')

plt.show()
In [25]:
cols=0
def linear_regression(data, power, models_to_plot, col, fig):
    #initialize predictors:
    global cols
    predictors=[col]
    if power>=2:
        predictors.extend([col+'_{}'.format(i) for i in range(2,power+1)])
    
#     data=data.drop_duplicates(subset=[col], keep="first")
    
    #Fit the model
    linreg = LinearRegression(normalize=True)
    linreg.fit(data[predictors],data['price'])
    y_pred = linreg.predict(data[predictors])    
    
    #Check if a plot is to be made for the entered power
    if power in models_to_plot:
        cols=cols+1
        sampled=data.sample(n=500, random_state=1234)
        fig.add_trace(go.Scatter(x=sampled[col], y=sampled['price'], mode="markers", name="data"), row=1, col=cols)
        fig.add_trace(go.Scatter(x=sampled[col], y=pd.Series(y_pred).sample(n=500, random_state=1234), mode="markers", name="estimate"), row=1, col=cols)
    
    #Return the result in pre-defined format
    rss = sum((y_pred-data['price'])**2)
    ret = [rss]
    ret.extend([linreg.intercept_])
    ret.extend(linreg.coef_)
    return ret
In [26]:
df_copy=df_mapped.copy()
In [27]:
col_features = ['x', 'y', 'z']
for col_feature in col_features:
    for i in range(2,16):  #power of 1 is already there
        colname = col_feature+'_{}'.format(i)      #new var will be x_power
        df_copy[colname] = df_copy[col_feature]**i
        
    global cols
    cols=0
    #Initialize a dataframe to store the results:
    col = ['rss','intercept'] + ['coef'+col_feature+'_{}'.format(i) for i in range(1,16)]
    ind = ['model_pow_{}'.format(i) for i in range(1,16)]
    coef_matrix_simple = pd.DataFrame(index=ind, columns=col)

    #Define the powers for which a plot is required:
    models_to_plot = {2:231,6:233,15:236}
    
    fig = make_subplots(rows=1, cols=3, subplot_titles=("Power 2", "Power 6", "Power 15"))

    #Iterate through all powers and assimilate results
    for i in range(1,16):
        coef_matrix_simple.iloc[i-1,0:i+2] =linear_regression(df_copy, power=i, models_to_plot=models_to_plot,col=col_feature, fig=fig)
    fig.update_layout(height=600, width=800, title_text=f"Non Linear Regression for feature {col_feature}")
    
    for i in range(1, 4):
        fig['layout']['xaxis{}'.format(i)]['title']=col_feature
    
    fig['layout']['yaxis']['title']="price"
    fig.show()
    display(coef_matrix_simple)
rss intercept coefx_1 coefx_2 coefx_3 coefx_4 coefx_5 coefx_6 coefx_7 coefx_8 coefx_9 coefx_10 coefx_11 coefx_12 coefx_13 coefx_14 coefx_15
model_pow_1 1.86954e+11 -14094.1 3145.41 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_2 1.20902e+11 12791.9 -6248.31 790.246 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_3 1.20307e+11 19581.8 -9824.89 1400.99 -33.842 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_4 1.14774e+11 6093.92 4676.11 -3531.73 639.769 -32.1179 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_5 1.13448e+11 8084.5 -8979.33 4766.1 -1285.78 164.385 -7.4027 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_6 1.12663e+11 8493.84 -46806.2 34695.4 -10598.6 1585.36 -113.699 3.11932 NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_7 1.12099e+11 8402.28 74575 -80682.6 34297.6 -7566.85 917.228 -57.7299 1.47051 NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_8 1.12096e+11 8403.98 43451.9 -46212.3 18199.3 -3457.44 297.922 -2.62037 -1.21121 0.0550633 NaN NaN NaN NaN NaN NaN NaN
model_pow_9 1.11906e+11 8407.51 -1.01843e+06 1.29655e+06 -713604 221057 -42115.5 5050.18 -371.991 15.3813 -0.27329 NaN NaN NaN NaN NaN NaN
model_pow_10 1.11767e+11 8406.76 2.5786e+06 -3.81887e+06 2.47638e+06 -923872 218547 -33992.2 3475.83 -225.276 8.39597 -0.137075 NaN NaN NaN NaN NaN
model_pow_11 1.11755e+11 8406.71 7.1584e+06 -1.10499e+07 7.55135e+06 -3.00884e+06 773858 -134190 15881.5 -1266.13 65.0417 -1.94313 0.0256227 NaN NaN NaN NaN
model_pow_12 1.11655e+11 8406.74 -4.58044e+07 8.10766e+07 -6.44888e+07 3.04201e+07 -9.45448e+06 2.03271e+06 -308489 33046.7 -2448.98 119.579 -3.46275 0.0450584 NaN NaN NaN
model_pow_13 1.11603e+11 8406.72 -2.04872e+08 3.83476e+08 -3.25298e+08 1.65362e+08 -5.61042e+07 1.33852e+07 -2.30279e+06 287889 -25961.6 1647.3 -69.8264 1.77581 -0.020497 NaN NaN
model_pow_14 1.11564e+11 8407.05 3.14927e+08 -6.90073e+08 6.88417e+08 -4.14109e+08 1.67628e+08 -4.823e+07 1.01501e+07 -1.58237e+06 182786 -15456 930.157 -37.7379 0.925422 -0.0103624 NaN
model_pow_15 1.11371e+11 8413.28 4.65782e+09 -1.03678e+10 1.06121e+10 -6.6205e+09 2.81253e+09 -8.6073e+08 1.95704e+08 -3.35873e+07 4.37279e+06 -429826 31405.1 -1654.2 59.3882 -1.30104 0.0131243
rss intercept coefy_1 coefy_2 coefy_3 coefy_4 coefy_5 coefy_6 coefy_7 coefy_8 coefy_9 coefy_10 coefy_11 coefy_12 coefy_13 coefy_14 coefy_15
model_pow_1 2.15517e+11 -13402 3022.89 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_2 1.95775e+11 -15761.2 3730.24 -49.6423 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_3 1.7107e+11 -3046.15 -932.377 401.752 -6.56847 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_4 1.15337e+11 25612.7 -13051.9 1984.21 -73.8428 0.744498 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_5 1.14511e+11 16756.1 -7797.8 893.166 18.9312 -2.08403 0.0261961 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_6 1.10369e+11 6781.94 7065.94 -5315.7 1098.33 -84.2331 2.40393 -0.0214768 NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_7 1.07996e+11 8778.99 -19777.5 13007.4 -3805.65 546.816 -36.7429 0.999211 -0.00875111 NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_8 1.07722e+11 8959.64 -54186.4 42342.8 -13911.6 2342.4 -210.478 9.81874 -0.213587 0.00165007 NaN NaN NaN NaN NaN NaN NaN
model_pow_9 1.07158e+11 8892.97 140035 -154237 69156.4 -16633.9 2320.95 -188.268 8.39717 -0.179672 0.00138036 NaN NaN NaN NaN NaN NaN
model_pow_10 1.07082e+11 8899.21 -128693 160820 -88891.9 27647.1 -5257.08 624.202 -45.2727 1.87261 -0.0382606 0.000286291 NaN NaN NaN NaN NaN
model_pow_11 1.07035e+11 8900.87 -982181 1.29432e+06 -745044 244856 -50568.2 6805.79 -598.383 33.4616 -1.11613 0.019488 -0.000131664 NaN NaN NaN NaN
model_pow_12 1.06903e+11 8877.06 4.37695e+06 -6.68396e+06 4.5092e+06 -1.77035e+06 447870 -76293.7 8871.57 -697.712 35.8855 -1.12689 0.0188896 -0.000124388 NaN NaN NaN
model_pow_13 1.0689e+11 8892.16 4.11529e+06 -6.16471e+06 4.06192e+06 -1.54784e+06 376474 -60725.7 6510.45 -447.799 17.694 -0.248793 -0.00732201 0.000298761 -2.71633e-06 NaN NaN
model_pow_14 1.06889e+11 8898.52 3.77031e+06 -5.60271e+06 3.65337e+06 -1.37313e+06 327742 -51443 5284.22 -336.461 11.0162 -0.0107526 -0.010678 0.000233929 8.80656e-08 -2.50863e-08 NaN
model_pow_15 1.06889e+11 8905.4 3.60959e+06 -5.351e+06 3.4793e+06 -1.30343e+06 309987 -48485.3 4971.29 -318.565 10.995 -0.08791 -0.0051892 7.6984e-05 9.23872e-07 1.53993e-08 -5.1234e-10
rss intercept coefz_1 coefz_2 coefz_3 coefz_4 coefz_5 coefz_6 coefz_7 coefz_8 coefz_9 coefz_10 coefz_11 coefz_12 coefz_13 coefz_14 coefz_15
model_pow_1 2.217e+11 -13296.6 4868.79 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_2 2.12469e+11 -15219.1 5859.02 -121.473 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_3 1.3434e+11 14286.2 -11955.2 2742.02 -74.7947 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_4 1.34188e+11 16425 -13965.9 3372.18 -146.408 1.6894 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_5 1.25151e+11 8713.35 3935.36 -7838.93 2632.71 -266.181 6.00674 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_6 1.24998e+11 8294.8 8749.49 -12442.9 4325.85 -549.771 26.2688 -0.40503 NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_7 1.22158e+11 8961.33 -55923.3 67893.9 -35160.5 9110.38 -1184.15 69.6424 -1.26998 NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_8 1.22109e+11 8910.94 -34128 35097.2 -14913.5 2519.21 23.4768 -52.379 4.76907 -0.100942 NaN NaN NaN NaN NaN NaN NaN
model_pow_9 1.20793e+11 8767.25 170238 -346397 283291 -124429 32021 -4894.38 426.907 -18.8054 0.293652 NaN NaN NaN NaN NaN NaN
model_pow_10 1.20418e+11 8798.36 -35066.1 134116 -187720 130404 -51891.8 12498 -1826.25 154.596 -6.68229 0.103302 NaN NaN NaN NaN NaN
model_pow_11 1.2034e+11 8804.27 -266618 766943 -917169 601604 -241553 62236.5 -10412.7 1112.06 -71.8374 2.48176 -0.0331006 NaN NaN NaN NaN
model_pow_12 1.20218e+11 8798.76 455794 -1.45288e+06 2.00437e+06 -1.5904e+06 804615 -271785 62347.3 -9669.92 986.744 -62.3213 2.14008 -0.0285756 NaN NaN NaN
model_pow_13 1.19998e+11 8835.55 -1.40743e+06 4.98336e+06 -7.66462e+06 6.81182e+06 -3.91524e+06 1.53681e+06 -422511 81842.1 -11043.5 1005.49 -57.9273 1.85245 -0.0235894 NaN NaN
model_pow_14 1.20011e+11 8857.73 -1.22846e+06 4.30773e+06 -6.54385e+06 5.72507e+06 -3.2259e+06 1.23414e+06 -327788 60449.2 -7561.42 603.215 -26.0421 0.226651 0.023302 -0.000553999 NaN
model_pow_15 1.20006e+11 8845.52 -1.18408e+06 4.12849e+06 -6.22774e+06 5.4019e+06 -3.01179e+06 1.13703e+06 -296811 53449.4 -6453.16 484.777 -18.2718 0.00383851 0.0178136 2.80225e-06 -9.99544e-06

Ridge-Regression

From the correlation table we can see that the rss is so big because of the big values of the prices. Doing an error on huge numbers will make the rss a great number too.

In [28]:
def ridge_regression(data, predictors, alpha,col,fig, models_to_plot={}):
    #Fit the model
    global cols
    ridgereg = Ridge(alpha=alpha,normalize=True)
    ridgereg.fit(data[predictors],data['price'])
    y_pred = ridgereg.predict(data[predictors])
    
    #Check if a plot is to be made for the entered alpha
    if alpha in models_to_plot:        
        cols=cols+1
        sampled=data.sample(n=500, random_state=1234)
        fig.add_trace(go.Scatter(x=sampled[col], y=sampled['price'], mode="markers", name="data"), row=1, col=cols)
        fig.add_trace(go.Scatter(x=sampled[col], y=pd.Series(y_pred).sample(n=500, random_state=1234), mode="markers", name="estimate"), row=1, col=cols)
    
    #Return the result in pre-defined format
    rss = sum((y_pred-data['price'])**2)
    ret = [rss]
    ret.extend([ridgereg.intercept_])
    ret.extend(ridgereg.coef_)
    return ret
In [29]:
#Initialize predictors to be set of 15 powers of x
for col_feature in col_features:
    global cols
    cols=0
    predictors=[col_feature]
    predictors.extend([col_feature+'_{}'.format(i) for i in range(2,16)])

    #Set the different values of alpha to be tested
    alpha_ridge = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20]

    #Initialize the dataframe for storing coefficients.
    col = ['rss','intercept'] + ['coef_'+col_feature+'_{}'.format(i) for i in range(1,16)]
    ind = ['alpha_{}'.format(alpha_ridge[i]) for i in range(0,10)]
    coef_matrix_ridge = pd.DataFrame(index=ind, columns=col)

    models_to_plot = {1e-15:231, 1e-10:232, 5:236}
    
    fig = make_subplots(rows=1, cols=3, subplot_titles=("Alpha 1e-15", "Alpha 1e-10", "Alpha 5"))
    for i in range(10):
        coef_matrix_ridge.iloc[i,] = ridge_regression(df_copy, predictors, alpha_ridge[i],col_feature,fig, models_to_plot)
        
    fig.update_layout(height=600, width=800, title_text=f"Ridge Regression for feature {col_feature}")
    for i in range(1, 4):
        fig['layout']['xaxis{}'.format(i)]['title']=col_feature
    
    fig['layout']['yaxis']['title']="price"
    fig.show()
    display(coef_matrix_simple)
rss intercept coefz_1 coefz_2 coefz_3 coefz_4 coefz_5 coefz_6 coefz_7 coefz_8 coefz_9 coefz_10 coefz_11 coefz_12 coefz_13 coefz_14 coefz_15
model_pow_1 2.217e+11 -13296.6 4868.79 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_2 2.12469e+11 -15219.1 5859.02 -121.473 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_3 1.3434e+11 14286.2 -11955.2 2742.02 -74.7947 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_4 1.34188e+11 16425 -13965.9 3372.18 -146.408 1.6894 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_5 1.25151e+11 8713.35 3935.36 -7838.93 2632.71 -266.181 6.00674 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_6 1.24998e+11 8294.8 8749.49 -12442.9 4325.85 -549.771 26.2688 -0.40503 NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_7 1.22158e+11 8961.33 -55923.3 67893.9 -35160.5 9110.38 -1184.15 69.6424 -1.26998 NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_8 1.22109e+11 8910.94 -34128 35097.2 -14913.5 2519.21 23.4768 -52.379 4.76907 -0.100942 NaN NaN NaN NaN NaN NaN NaN
model_pow_9 1.20793e+11 8767.25 170238 -346397 283291 -124429 32021 -4894.38 426.907 -18.8054 0.293652 NaN NaN NaN NaN NaN NaN
model_pow_10 1.20418e+11 8798.36 -35066.1 134116 -187720 130404 -51891.8 12498 -1826.25 154.596 -6.68229 0.103302 NaN NaN NaN NaN NaN
model_pow_11 1.2034e+11 8804.27 -266618 766943 -917169 601604 -241553 62236.5 -10412.7 1112.06 -71.8374 2.48176 -0.0331006 NaN NaN NaN NaN
model_pow_12 1.20218e+11 8798.76 455794 -1.45288e+06 2.00437e+06 -1.5904e+06 804615 -271785 62347.3 -9669.92 986.744 -62.3213 2.14008 -0.0285756 NaN NaN NaN
model_pow_13 1.19998e+11 8835.55 -1.40743e+06 4.98336e+06 -7.66462e+06 6.81182e+06 -3.91524e+06 1.53681e+06 -422511 81842.1 -11043.5 1005.49 -57.9273 1.85245 -0.0235894 NaN NaN
model_pow_14 1.20011e+11 8857.73 -1.22846e+06 4.30773e+06 -6.54385e+06 5.72507e+06 -3.2259e+06 1.23414e+06 -327788 60449.2 -7561.42 603.215 -26.0421 0.226651 0.023302 -0.000553999 NaN
model_pow_15 1.20006e+11 8845.52 -1.18408e+06 4.12849e+06 -6.22774e+06 5.4019e+06 -3.01179e+06 1.13703e+06 -296811 53449.4 -6453.16 484.777 -18.2718 0.00383851 0.0178136 2.80225e-06 -9.99544e-06
rss intercept coefz_1 coefz_2 coefz_3 coefz_4 coefz_5 coefz_6 coefz_7 coefz_8 coefz_9 coefz_10 coefz_11 coefz_12 coefz_13 coefz_14 coefz_15
model_pow_1 2.217e+11 -13296.6 4868.79 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_2 2.12469e+11 -15219.1 5859.02 -121.473 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_3 1.3434e+11 14286.2 -11955.2 2742.02 -74.7947 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_4 1.34188e+11 16425 -13965.9 3372.18 -146.408 1.6894 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_5 1.25151e+11 8713.35 3935.36 -7838.93 2632.71 -266.181 6.00674 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_6 1.24998e+11 8294.8 8749.49 -12442.9 4325.85 -549.771 26.2688 -0.40503 NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_7 1.22158e+11 8961.33 -55923.3 67893.9 -35160.5 9110.38 -1184.15 69.6424 -1.26998 NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_8 1.22109e+11 8910.94 -34128 35097.2 -14913.5 2519.21 23.4768 -52.379 4.76907 -0.100942 NaN NaN NaN NaN NaN NaN NaN
model_pow_9 1.20793e+11 8767.25 170238 -346397 283291 -124429 32021 -4894.38 426.907 -18.8054 0.293652 NaN NaN NaN NaN NaN NaN
model_pow_10 1.20418e+11 8798.36 -35066.1 134116 -187720 130404 -51891.8 12498 -1826.25 154.596 -6.68229 0.103302 NaN NaN NaN NaN NaN
model_pow_11 1.2034e+11 8804.27 -266618 766943 -917169 601604 -241553 62236.5 -10412.7 1112.06 -71.8374 2.48176 -0.0331006 NaN NaN NaN NaN
model_pow_12 1.20218e+11 8798.76 455794 -1.45288e+06 2.00437e+06 -1.5904e+06 804615 -271785 62347.3 -9669.92 986.744 -62.3213 2.14008 -0.0285756 NaN NaN NaN
model_pow_13 1.19998e+11 8835.55 -1.40743e+06 4.98336e+06 -7.66462e+06 6.81182e+06 -3.91524e+06 1.53681e+06 -422511 81842.1 -11043.5 1005.49 -57.9273 1.85245 -0.0235894 NaN NaN
model_pow_14 1.20011e+11 8857.73 -1.22846e+06 4.30773e+06 -6.54385e+06 5.72507e+06 -3.2259e+06 1.23414e+06 -327788 60449.2 -7561.42 603.215 -26.0421 0.226651 0.023302 -0.000553999 NaN
model_pow_15 1.20006e+11 8845.52 -1.18408e+06 4.12849e+06 -6.22774e+06 5.4019e+06 -3.01179e+06 1.13703e+06 -296811 53449.4 -6453.16 484.777 -18.2718 0.00383851 0.0178136 2.80225e-06 -9.99544e-06
rss intercept coefz_1 coefz_2 coefz_3 coefz_4 coefz_5 coefz_6 coefz_7 coefz_8 coefz_9 coefz_10 coefz_11 coefz_12 coefz_13 coefz_14 coefz_15
model_pow_1 2.217e+11 -13296.6 4868.79 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_2 2.12469e+11 -15219.1 5859.02 -121.473 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_3 1.3434e+11 14286.2 -11955.2 2742.02 -74.7947 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_4 1.34188e+11 16425 -13965.9 3372.18 -146.408 1.6894 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_5 1.25151e+11 8713.35 3935.36 -7838.93 2632.71 -266.181 6.00674 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_6 1.24998e+11 8294.8 8749.49 -12442.9 4325.85 -549.771 26.2688 -0.40503 NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_7 1.22158e+11 8961.33 -55923.3 67893.9 -35160.5 9110.38 -1184.15 69.6424 -1.26998 NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_8 1.22109e+11 8910.94 -34128 35097.2 -14913.5 2519.21 23.4768 -52.379 4.76907 -0.100942 NaN NaN NaN NaN NaN NaN NaN
model_pow_9 1.20793e+11 8767.25 170238 -346397 283291 -124429 32021 -4894.38 426.907 -18.8054 0.293652 NaN NaN NaN NaN NaN NaN
model_pow_10 1.20418e+11 8798.36 -35066.1 134116 -187720 130404 -51891.8 12498 -1826.25 154.596 -6.68229 0.103302 NaN NaN NaN NaN NaN
model_pow_11 1.2034e+11 8804.27 -266618 766943 -917169 601604 -241553 62236.5 -10412.7 1112.06 -71.8374 2.48176 -0.0331006 NaN NaN NaN NaN
model_pow_12 1.20218e+11 8798.76 455794 -1.45288e+06 2.00437e+06 -1.5904e+06 804615 -271785 62347.3 -9669.92 986.744 -62.3213 2.14008 -0.0285756 NaN NaN NaN
model_pow_13 1.19998e+11 8835.55 -1.40743e+06 4.98336e+06 -7.66462e+06 6.81182e+06 -3.91524e+06 1.53681e+06 -422511 81842.1 -11043.5 1005.49 -57.9273 1.85245 -0.0235894 NaN NaN
model_pow_14 1.20011e+11 8857.73 -1.22846e+06 4.30773e+06 -6.54385e+06 5.72507e+06 -3.2259e+06 1.23414e+06 -327788 60449.2 -7561.42 603.215 -26.0421 0.226651 0.023302 -0.000553999 NaN
model_pow_15 1.20006e+11 8845.52 -1.18408e+06 4.12849e+06 -6.22774e+06 5.4019e+06 -3.01179e+06 1.13703e+06 -296811 53449.4 -6453.16 484.777 -18.2718 0.00383851 0.0178136 2.80225e-06 -9.99544e-06

Lasso

Do Lasso to check the columns that infers more

In [30]:
def lasso_regression(data, predictors, alpha,col,fig, models_to_plot={}):
    global cols
    #Fit the model
    lassoreg = Lasso(alpha=alpha,normalize=True, max_iter=1e2)
    lassoreg.fit(data[predictors],data['price'])
    y_pred = lassoreg.predict(data[predictors])
    
    #Check if a plot is to be made for the entered alpha
    if alpha in models_to_plot:
        cols=cols+1
        sampled=data.sample(n=500, random_state=1234)
        fig.add_trace(go.Scatter(x=sampled[col], y=sampled['price'], mode="markers", name="data"), row=1, col=cols)
        fig.add_trace(go.Scatter(x=sampled[col], y=pd.Series(y_pred).sample(n=500, random_state=1234), mode="markers", name="estimate"), row=1, col=cols)
    
    #Return the result in pre-defined format
    rss = sum((y_pred-data['price'])**2)
    ret = [rss]
    ret.extend([lassoreg.intercept_])
    ret.extend(lassoreg.coef_)
    return ret
In [31]:
#Initialize predictors to be set of 15 powers of x
for col_feature in col_features:
    global cols
    cols=0
    predictors=[col_feature]
    predictors.extend([col_feature+'_{}'.format(i) for i in range(2,16)])

    #Set the different values of alpha to be tested
    alpha_ridge = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20]

    #Initialize the dataframe for storing coefficients.
    col = ['rss','intercept'] + ['coef_'+col_feature+'_{}'.format(i) for i in range(1,16)]
    ind = ['alpha_{}'.format(alpha_ridge[i]) for i in range(0,10)]
    coef_matrix_ridge = pd.DataFrame(index=ind, columns=col)

    models_to_plot = {1e-15:231, 1e-10:232, 5:236}
    
    fig = make_subplots(rows=1, cols=3, subplot_titles=("Alpha 1e-15", "Alpha 1e-10", "Alpha 5"))
    for i in range(10):
        coef_matrix_ridge.iloc[i,] = lasso_regression(df_copy, predictors, alpha_ridge[i],col_feature,fig, models_to_plot)
        
    fig.update_layout(height=600, width=800, title_text=f"Ridge Regression for feature {col_feature}")
    for i in range(1, 4):
        fig['layout']['xaxis{}'.format(i)]['title']=col_feature
    
    fig['layout']['yaxis']['title']="price"
    fig.show()
    display(coef_matrix_simple)
C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 57651714885.70906, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 57651714755.14087, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 57651701828.757706, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 57520961678.93393, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 56327857082.33457, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 42744663693.95874, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 628430506.3542633, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 357646514.3161011, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 105515872.39941406, tolerance: 85847313.55173963

rss intercept coefz_1 coefz_2 coefz_3 coefz_4 coefz_5 coefz_6 coefz_7 coefz_8 coefz_9 coefz_10 coefz_11 coefz_12 coefz_13 coefz_14 coefz_15
model_pow_1 2.217e+11 -13296.6 4868.79 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_2 2.12469e+11 -15219.1 5859.02 -121.473 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_3 1.3434e+11 14286.2 -11955.2 2742.02 -74.7947 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_4 1.34188e+11 16425 -13965.9 3372.18 -146.408 1.6894 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_5 1.25151e+11 8713.35 3935.36 -7838.93 2632.71 -266.181 6.00674 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_6 1.24998e+11 8294.8 8749.49 -12442.9 4325.85 -549.771 26.2688 -0.40503 NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_7 1.22158e+11 8961.33 -55923.3 67893.9 -35160.5 9110.38 -1184.15 69.6424 -1.26998 NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_8 1.22109e+11 8910.94 -34128 35097.2 -14913.5 2519.21 23.4768 -52.379 4.76907 -0.100942 NaN NaN NaN NaN NaN NaN NaN
model_pow_9 1.20793e+11 8767.25 170238 -346397 283291 -124429 32021 -4894.38 426.907 -18.8054 0.293652 NaN NaN NaN NaN NaN NaN
model_pow_10 1.20418e+11 8798.36 -35066.1 134116 -187720 130404 -51891.8 12498 -1826.25 154.596 -6.68229 0.103302 NaN NaN NaN NaN NaN
model_pow_11 1.2034e+11 8804.27 -266618 766943 -917169 601604 -241553 62236.5 -10412.7 1112.06 -71.8374 2.48176 -0.0331006 NaN NaN NaN NaN
model_pow_12 1.20218e+11 8798.76 455794 -1.45288e+06 2.00437e+06 -1.5904e+06 804615 -271785 62347.3 -9669.92 986.744 -62.3213 2.14008 -0.0285756 NaN NaN NaN
model_pow_13 1.19998e+11 8835.55 -1.40743e+06 4.98336e+06 -7.66462e+06 6.81182e+06 -3.91524e+06 1.53681e+06 -422511 81842.1 -11043.5 1005.49 -57.9273 1.85245 -0.0235894 NaN NaN
model_pow_14 1.20011e+11 8857.73 -1.22846e+06 4.30773e+06 -6.54385e+06 5.72507e+06 -3.2259e+06 1.23414e+06 -327788 60449.2 -7561.42 603.215 -26.0421 0.226651 0.023302 -0.000553999 NaN
model_pow_15 1.20006e+11 8845.52 -1.18408e+06 4.12849e+06 -6.22774e+06 5.4019e+06 -3.01179e+06 1.13703e+06 -296811 53449.4 -6453.16 484.777 -18.2718 0.00383851 0.0178136 2.80225e-06 -9.99544e-06
C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 87663854888.01605, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 87663854679.23033, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 87663834009.23819, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 87455262081.28275, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 85595374398.21869, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 68636813011.58795, tolerance: 85847313.55173963

rss intercept coefz_1 coefz_2 coefz_3 coefz_4 coefz_5 coefz_6 coefz_7 coefz_8 coefz_9 coefz_10 coefz_11 coefz_12 coefz_13 coefz_14 coefz_15
model_pow_1 2.217e+11 -13296.6 4868.79 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_2 2.12469e+11 -15219.1 5859.02 -121.473 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_3 1.3434e+11 14286.2 -11955.2 2742.02 -74.7947 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_4 1.34188e+11 16425 -13965.9 3372.18 -146.408 1.6894 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_5 1.25151e+11 8713.35 3935.36 -7838.93 2632.71 -266.181 6.00674 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_6 1.24998e+11 8294.8 8749.49 -12442.9 4325.85 -549.771 26.2688 -0.40503 NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_7 1.22158e+11 8961.33 -55923.3 67893.9 -35160.5 9110.38 -1184.15 69.6424 -1.26998 NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_8 1.22109e+11 8910.94 -34128 35097.2 -14913.5 2519.21 23.4768 -52.379 4.76907 -0.100942 NaN NaN NaN NaN NaN NaN NaN
model_pow_9 1.20793e+11 8767.25 170238 -346397 283291 -124429 32021 -4894.38 426.907 -18.8054 0.293652 NaN NaN NaN NaN NaN NaN
model_pow_10 1.20418e+11 8798.36 -35066.1 134116 -187720 130404 -51891.8 12498 -1826.25 154.596 -6.68229 0.103302 NaN NaN NaN NaN NaN
model_pow_11 1.2034e+11 8804.27 -266618 766943 -917169 601604 -241553 62236.5 -10412.7 1112.06 -71.8374 2.48176 -0.0331006 NaN NaN NaN NaN
model_pow_12 1.20218e+11 8798.76 455794 -1.45288e+06 2.00437e+06 -1.5904e+06 804615 -271785 62347.3 -9669.92 986.744 -62.3213 2.14008 -0.0285756 NaN NaN NaN
model_pow_13 1.19998e+11 8835.55 -1.40743e+06 4.98336e+06 -7.66462e+06 6.81182e+06 -3.91524e+06 1.53681e+06 -422511 81842.1 -11043.5 1005.49 -57.9273 1.85245 -0.0235894 NaN NaN
model_pow_14 1.20011e+11 8857.73 -1.22846e+06 4.30773e+06 -6.54385e+06 5.72507e+06 -3.2259e+06 1.23414e+06 -327788 60449.2 -7561.42 603.215 -26.0421 0.226651 0.023302 -0.000553999 NaN
model_pow_15 1.20006e+11 8845.52 -1.18408e+06 4.12849e+06 -6.22774e+06 5.4019e+06 -3.01179e+06 1.13703e+06 -296811 53449.4 -6453.16 484.777 -18.2718 0.00383851 0.0178136 2.80225e-06 -9.99544e-06
C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 76643982098.60341, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 76643982015.29265, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 76643974154.57953, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 76568138233.23845, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 75885934365.49455, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 69099137304.37404, tolerance: 85847313.55173963

rss intercept coefz_1 coefz_2 coefz_3 coefz_4 coefz_5 coefz_6 coefz_7 coefz_8 coefz_9 coefz_10 coefz_11 coefz_12 coefz_13 coefz_14 coefz_15
model_pow_1 2.217e+11 -13296.6 4868.79 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_2 2.12469e+11 -15219.1 5859.02 -121.473 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_3 1.3434e+11 14286.2 -11955.2 2742.02 -74.7947 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_4 1.34188e+11 16425 -13965.9 3372.18 -146.408 1.6894 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_5 1.25151e+11 8713.35 3935.36 -7838.93 2632.71 -266.181 6.00674 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_6 1.24998e+11 8294.8 8749.49 -12442.9 4325.85 -549.771 26.2688 -0.40503 NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_7 1.22158e+11 8961.33 -55923.3 67893.9 -35160.5 9110.38 -1184.15 69.6424 -1.26998 NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_8 1.22109e+11 8910.94 -34128 35097.2 -14913.5 2519.21 23.4768 -52.379 4.76907 -0.100942 NaN NaN NaN NaN NaN NaN NaN
model_pow_9 1.20793e+11 8767.25 170238 -346397 283291 -124429 32021 -4894.38 426.907 -18.8054 0.293652 NaN NaN NaN NaN NaN NaN
model_pow_10 1.20418e+11 8798.36 -35066.1 134116 -187720 130404 -51891.8 12498 -1826.25 154.596 -6.68229 0.103302 NaN NaN NaN NaN NaN
model_pow_11 1.2034e+11 8804.27 -266618 766943 -917169 601604 -241553 62236.5 -10412.7 1112.06 -71.8374 2.48176 -0.0331006 NaN NaN NaN NaN
model_pow_12 1.20218e+11 8798.76 455794 -1.45288e+06 2.00437e+06 -1.5904e+06 804615 -271785 62347.3 -9669.92 986.744 -62.3213 2.14008 -0.0285756 NaN NaN NaN
model_pow_13 1.19998e+11 8835.55 -1.40743e+06 4.98336e+06 -7.66462e+06 6.81182e+06 -3.91524e+06 1.53681e+06 -422511 81842.1 -11043.5 1005.49 -57.9273 1.85245 -0.0235894 NaN NaN
model_pow_14 1.20011e+11 8857.73 -1.22846e+06 4.30773e+06 -6.54385e+06 5.72507e+06 -3.2259e+06 1.23414e+06 -327788 60449.2 -7561.42 603.215 -26.0421 0.226651 0.023302 -0.000553999 NaN
model_pow_15 1.20006e+11 8845.52 -1.18408e+06 4.12849e+06 -6.22774e+06 5.4019e+06 -3.01179e+06 1.13703e+06 -296811 53449.4 -6453.16 484.777 -18.2718 0.00383851 0.0178136 2.80225e-06 -9.99544e-06
In [32]:
def lasso_regression(data, predictors, alpha, models_to_plot={}):
    #Fit the model
    lassoreg = Lasso(alpha=alpha,normalize=True, max_iter=1e5)
    lassoreg.fit(data[predictors],data['price'])
    y_pred = lassoreg.predict(data[predictors])
    
    #Check if a plot is to be made for the entered alpha
    if alpha in models_to_plot:
        plt.subplot(models_to_plot[alpha])
        plt.tight_layout()
        plt.plot(data[predictors],y_pred)
        plt.plot(data[predictors],data['price'],'.')
        plt.title('Plot for alpha: {}'.format(alpha))
    
    #Return the result in pre-defined format
    rss = sum((y_pred-data['price'])**2)
    ret = [rss]
    ret.extend([lassoreg.intercept_])
    ret.extend(lassoreg.coef_)
    return ret

As mentioned above, the variable with the most importance seems to be carat. In fact, from the table we can see that the feature that is never sent to 0 is x, even though the correlation between price and carat is more significant than the correlation between price and feature x. (0.92 control 0.88 respectively).

In [33]:
#Initialize predictors to all 15 powers of x
predictors=["cut", "table", "depth", "color", "clarity", "x", "y", "z"]

#Define the alpha values to test
alpha_lasso = [1e-15, 1e-10, 1e-8, 1e-5,1e-4, 1e-3,1e-2, 1, 5, 10]

#Initialize the dataframe to store coefficients
col = ['rss','intercept'] + ['coef_{}'.format(i) for i in predictors]
ind = ['alpha_{}'.format(alpha_lasso[i]) for i in range(0,10)]
coef_matrix_lasso = pd.DataFrame(index=ind, columns=col)

#Define the models to plot
models_to_plot = {}

#Iterate over the 10 alpha values:
for i in range(10):
    coef_matrix_lasso.iloc[i,] = lasso_regression(df_mapped, predictors, alpha_lasso[i], models_to_plot)
display(coef_matrix_lasso)
rss intercept coef_cut coef_table coef_depth coef_color coef_clarity coef_x coef_y coef_z
alpha_1e-15 1.4342e+11 -28551.5 130.644 -1.90243 117.747 247.12 543.245 3373.59 180.224 19.4235
alpha_1e-10 1.4342e+11 -28551.5 130.644 -1.90243 117.747 247.12 543.245 3373.59 180.224 19.4235
alpha_1e-08 1.4342e+11 -28551.5 130.644 -1.90243 117.747 247.12 543.245 3373.59 180.224 19.4235
alpha_1e-05 1.4342e+11 -28551.7 130.642 -1.902 117.75 247.119 543.243 3373.58 180.275 19.3568
alpha_0.0001 1.4342e+11 -28554.5 130.621 -1.90215 117.8 247.106 543.232 3374.04 180.493 18.2406
alpha_0.001 1.4342e+11 -28572.4 130.382 -1.92465 118.155 246.971 543.123 3380.53 179.69 8.8834
alpha_0.01 1.43422e+11 -28369.9 128.227 -1.73562 115.61 244.858 540.881 3373.49 177.978 16.0154
alpha_1 1.61603e+11 -15725.1 0 -0 0 20.0376 290.918 2930.51 86.059 147.103
alpha_5 2.59692e+11 -8161.09 0 0 0 -0 0 2110.2 0 0
alpha_10 4.77906e+11 -2228.13 -0 0 -0 -0 -0 1074.99 0 0

Classification

In this section we will have an overview regarding the classification of our dataset. There will be focused all the cathegorical variables ('color', 'cut' and 'clarity').
We can start with a multiple plot created with Seaborn. In this case we can see the variety of possible classification that Seaborn can realize.

In [34]:
plt.figure(figsize=(10,8), dpi= 80)
sns.pairplot(df, kind="scatter", hue="color", plot_kws=dict(s=80, edgecolor="white", linewidth=1.0))
plt.show()
<Figure size 800x640 with 0 Axes>

K-Nearest Neighbors

In our project we compute neighbours basing on euclidean distance.

In [35]:
def euclidean_distance(row1, row2):
    distance = 0.0
    for i in range(len(row1)-1):
        distance += (row1[i] - row2[i])**2
    return sqrt(distance)
In [36]:
# Locate the most similar neighbors
def get_neighbors(train, test_row, num_neighbors):
    distances = list()
    for train_row in train:
        dist = euclidean_distance(test_row, train_row)
        distances.append((train_row, dist))
    distances.sort(key=lambda tup: tup[1])
    neighbors = list()
    for i in range(num_neighbors):
        neighbors.append(distances[i][0])
    return neighbors


# neighbors = get_neighbors(dataset, dataset[0], 3)
# for neighbor in neighbors:
#     print(neighbor)
In [37]:
# Make a classification prediction with neighbors
def predict_classification(train, test_row, num_neighbors):
    neighbors = get_neighbors(train, test_row, num_neighbors)
    output_values = [row[-1] for row in neighbors]
    prediction = max(set(output_values), key=output_values.count)
    return prediction

# prediction = predict_classification(df, dataset[0], 3)
# print('Expected %d, Got %d.' % (df[0][-1], prediction))
In [38]:
def set_df_last_column(df, column_name):
    df_col=df.drop(column_name, axis=1)
    df_col[column_name]=df[column_name]
    return df_col
In [39]:
feature_clas_cols=["cut", "color", "clarity"]

for i in feature_clas_cols:
    df_color=set_df_last_column(df_mapped, i)

    X = df_color.iloc[:, :-1].values
    y = df_color.iloc[:, len(df_mapped.columns)-1].values

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

    scaler = StandardScaler()
    scaler.fit(X_train)

    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)

    classifier = KNeighborsClassifier(n_neighbors=5)
    classifier.fit(X_train, y_train)

    y_pred = classifier.predict(X_test)

    printmd(f"**Printing the confusione matrix for {i}**")
    
    print(confusion_matrix(y_test, y_pred))
    classification_report(y_test, y_pred)


    error = []

    # Calculating error for K values between 1 and 40
    for i in range(1, 20):
        knn = KNeighborsClassifier(n_neighbors=i)
        knn.fit(X_train, y_train)
        pred_i = knn.predict(X_test)
        error.append(np.mean(pred_i != y_test))

    plt.figure(figsize=(12, 6))
    plt.plot(range(1, 20), error, color='red', linestyle='dashed', marker='o',
             markerfacecolor='blue', markersize=10)
    plt.title('Error Rate K Value')
    plt.xlabel('K Value')
    plt.ylabel('Mean Error')

    plt.show()

    # plt.savefig("knn_color.png", dpi=300, bbox_inches='tight')

Printing the confusione matrix for cut

[[ 230   82    5    6    5]
 [  33  535  318   90   34]
 [   2  282  829  746  553]
 [   0   57  623 1770  268]
 [   2   19  319  229 3751]]

Printing the confusione matrix for color

[[205 147  95  65  29  30  11]
 [120 368 262 144  84  63  49]
 [ 77 249 553 320 168 152  97]
 [ 63 181 392 906 337 245  91]
 [ 35 115 279 498 521 348 142]
 [ 37  99 227 400 398 576 211]
 [ 27  64 137 241 274 350 306]]

Printing the confusione matrix for clarity

[[  45   80   14    7    0    0    0    0]
 [  25 1040  559  153   47    9   10    1]
 [  13  619 1242  498  163   49   25    7]
 [  10  223  724 1080  270  112   50    7]
 [   3  110  347  516  435  126   64   13]
 [   0   46  158  221  210  244   69   22]
 [   1   24  121  159  137  153  134   28]
 [   0    5   37   55   58   78   70   62]]

SVM

We implemented a simple method that taking a dataframe and a feature name as input will return a new dataframe with the target value (specified as parameter) in the last column. This is made to use subsequently the method getDatasetPatternsAndLabels().
As made before, we split test and validation; Subsequently, we consider a sample of 1000 elements and using as feature columns 'table' and 'x' and as target 'color'. In this way we are going to train 3 different classifiers (KNN, linear SVM and radial-basis-function SVM) to compare the final results. We used a k-value of 5 for KNN and for each classifier we split in train and test with a 0.4 ratio.

In [40]:
def getDatasetPatternsAndLabels(df, featureCount):
    patterns=[]
    labels=df.iloc[:, -1:].values.reshape(1,-1)[0]
    
    for i, row in df.iterrows():
        currentRow=[]
        for columnName in range(featureCount):
            currentRow.append(row[df.columns[columnName]])
        patterns.append(currentRow)
    
    return pd.DataFrame(patterns), pd.DataFrame(labels)[0]
In [41]:
feature_count = 5

dfCut = set_df_last_column(df_mapped, "clarity")
dataset_patterns, dataset_labels = getDatasetPatternsAndLabels(dfCut, feature_count)
In [42]:
train_x, validation_x, train_y, validation_y = train_test_split(dataset_patterns, dataset_labels, test_size=0.25)
print('Shape training set:', train_x.shape)
print('Shape validation set:', validation_x.shape)
Shape training set: (40455, 5)
Shape validation set: (13485, 5)
In [43]:
# Uso di show_results
# ml_visualization.show_2D_results(clf, (train_x, train_y, 'Training'), (validation_x, validation_y, 'Validation'))
In [44]:
# separate array into input and output components

df_sampled=df_mapped.head(n=1000)
#split dataset in features and target variable
feature_cols = ['table', 'x']
X = df_sampled[feature_cols] # Features
y = df_sampled.color.astype('float64') # Target variable

scaler = StandardScaler().fit(X)
X = scaler.transform(X)
# summarize transformed data
np.set_printoptions(precision=3)

train_x, validation_x, train_y, validation_y = train_test_split(X, y, test_size=0.40)
print('Shape dataset:', train_x.shape)
print('Shape labels:', validation_x.shape)
# train_x=train_x.values
# validation_x=validation_x.values
train_y=train_y.values
validation_y=validation_y.values
# validation_x
# train_y
train_x
Shape dataset: (600, 2)
Shape labels: (400, 2)
Out[44]:
array([[ 0.108,  0.295],
       [ 0.918,  0.199],
       [-0.419, -0.026],
       ...,
       [ 0.108,  0.615],
       [-0.298,  0.487],
       [-1.514,  0.183]])
In [45]:
# Creazione del classificatore: soluzione
# Di seguito รจ riportata la procedura corretta per creare i suddetti classificatori.

clf_1 = KNeighborsClassifier(n_neighbors=5)
clf_2 = SVC(kernel="linear", C=1)
clf_3 = SVC(kernel="rbf", C=1, gamma=0.1)

printmd("**KNN**")
# Addestramento del classificatore: soluzione
clf_1.fit(train_x, train_y)
# Calcolo accuratezza
print("Accuratezza sul training set con KNN: {}".format(clf_1.score(train_x, train_y)))
print("Accuratezza sul validation set con  KNN: {}".format(clf_1.score(validation_x, validation_y)))

printmd("**SVM lineare**")
# Addestramento del classificatore: soluzione
clf_2.fit(train_x, train_y)
# Calcolo accuratezza
print("Accuratezza sul training set con SVM con kernel lineare: {}".format(clf_2.score(train_x, train_y)))
print("Accuratezza sul validation set con SVM con kernel lineare: {}".format(clf_2.score(validation_x, validation_y)))

printmd("**SVM rbf**")
# Addestramento del classificatore: soluzione
clf_3.fit(train_x, train_y)
# Calcolo accuratezza
print("Accuratezza sul training set con SVM con kernel rbf: {}".format(clf_3.score(train_x, train_y)))
print("Accuratezza sul validation set con SVM con kernel rbf: {}".format(clf_3.score(validation_x, validation_y)))

#Visualizzazione 2D
# For some reason this plot is not working
# ml_visualization.show_2D_results(clf, (train_x, train_y, 'Training'), (validation_x, validation_y, 'Validation'))

KNN

Accuratezza sul training set con KNN: 0.48
Accuratezza sul validation set con  KNN: 0.235

SVM lineare

Accuratezza sul training set con SVM con kernel lineare: 0.24
Accuratezza sul validation set con SVM con kernel lineare: 0.23

SVM rbf

Accuratezza sul training set con SVM con kernel rbf: 0.25166666666666665
Accuratezza sul validation set con SVM con kernel rbf: 0.2175

Bonus: how much should we spend on a diamond according to its characteristics?

Now, we would like to sum up with a triplet of plot representing the mean price trend in relation to clarity, cut and color. In this we a person could approssimate if it is worth to buy a considered diamond with certain features.

In [46]:
def display_price(df, age = (1,7), price = (100,100000), vehicle_type = "all", state = "all"):
    # Display the median price of vehicles depending on its type and its state.

    df = df[(df[vehicle_type] <= age[1]) & (df[vehicle_type] >= age[0])]

    df = df[(df["price"] >= price[0]) & (df["price"] <= price[1])]

    price_age = pd.pivot_table(df, values = "price", index = vehicle_type, aggfunc= np.median)
    price_age.columns = ["Median Price"]

    fig = plt.figure(figsize=(12,6))
    ax = fig.add_axes([0,0,1,1])
    ax2 = fig.add_axes([0.6,0.47,.35,.35])

    ax.plot(price_age["Median Price"], lw = 5)

    ax2.set_title(f"Column type: {vehicle_type}\nNumber of diamonds: {df.shape[0]}", fontsize = 15)
    ax2.axis('off')

    ax.set_title(f"Median price by {vehicle_type} of the diamonds",fontsize=25)
    ax.set_ylim(0,price_age["Median Price"].max()+1000)
    ax.set_xlabel(vehicle_type, fontsize = 15)
    ax.set_ylabel("Median price in $", fontsize = 15)

    ax.tick_params(axis='both', which='major', labelsize=15) 

#     plt.savefig(f"medians{vehicle_type}.png",dpi=300, bbox_inches='tight')
In [47]:
display_price(df_mapped, vehicle_type="color")
display_price(df_mapped,age=(1, 5), vehicle_type="cut")
display_price(df_mapped,age=(3, 10), vehicle_type="clarity")